-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add fluentd logging driver config check #4592
add fluentd logging driver config check #4592
Conversation
Returns an error string if the above condition is not met, or None otherwise.""" | ||
|
||
pod_name = fluentd_pods[0]["metadata"]["name"] | ||
use_journald = self._exec_oc("exec {} /bin/printenv USE_JOURNAL".format(pod_name), [], task_vars) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sosiouxme I decided to read the value of the env var rather than the ansible variable (openshift_logging_fluentd_use_journal
) since I believe the ansible variable is not set by default (https://github.com/openshift/openshift-ansible/blob/master/inventory/byo/hosts.origin.example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have one extra problem here that just occurred to me. This check only runs against one master. It will correctly check whether the docker there has a matching journal driver, but it won't check any of the other hosts. This may be tricky to accomplish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, we may have to have a separate fluentd check that does not inherit from LoggingCheck
5b998dd
to
7412780
Compare
aos-ci-test |
75f7a0f
to
c34475b
Compare
@sosiouxme went ahead and moved the fluentd config check to its own file. PTAL |
"""Check that Fluentd has running pods, and that its logging config matches Docker's logging config.""" | ||
|
||
self.logging_namespace = get_var(task_vars, "openshift_logging_namespace", default="logging") | ||
fluentd_pods, error = self.get_pods_for_component( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and other oc
calls will only work on master, though. That's what makes this tricky. We could say to just look at the logging var to see what fluentd is supposed to have and not check what it actually has. Otherwise we're gonna have to do some really weird delegation to run oc
against master while we're looking at everything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hesitant to rely on the logging var since its value is not set by default, but I guess for now it might be the easiest starting point for this. Will update
c34475b
to
6b3bc00
Compare
then the value of the configured Docker logging driver should be "journald". | ||
Otherwise, the value of the Docker logging driver should be "json-file". | ||
Returns an error string if the above condition is not met, or None otherwise.""" | ||
use_journald = get_var(task_vars, "openshift_logging_fluentd_use_journal", default=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sosiouxme removed use of ocutil
in favor of using the value of the ansible variable. Defaulted value to True
, making the check assume journalctl
as the default logging driver, unless explicitly set by a user to False
. Not entirely sure about this, however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May or may not have any consequence to this PR, but in any case, FWIW I see this variable was deprecated 14 days ago:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not just deprecated, it's obviated... with 3.6 fluentd looks at docker config to use the same driver. That makes this whole check kind of pointless for 3.6. And I'm not sure what we plan to tell customers about running checks against earlier versions?
6b3bc00
to
0d803ad
Compare
aos-ci-test |
0d803ad
to
7c0b63f
Compare
@sosiouxme switched back to using |
a58bbba
to
9bbc959
Compare
Ran into this while testing:
Unfortunately there doesn't seem to be any more helpful debug output. |
But speaking of the above, why try to exec into the pod? You should have the pod definition right there to look at what the env vars are. |
148e381
to
efd978b
Compare
Thanks, updated check / tests to do this instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny nits; but this is going to take me a bit longer to review properly
@@ -0,0 +1,190 @@ | |||
""" | |||
Module for performing checks on an Fluentd logging deployment configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/an/a/
|
||
running_fluentd_pods = [pod for pod in fluentd_pods if pod['status']['phase'] == 'Running'] | ||
if not len(running_fluentd_pods): | ||
msg = ('No Fluentd pods were found to be in the "Running" state.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs a space after the period
then the value of the configured Docker logging driver should be "journald". | ||
Otherwise, the value of the Docker logging driver should be "json-file". | ||
Returns an error string if the above condition is not met, or None otherwise.""" | ||
use_journald = get_var(task_vars, "openshift_logging_fluentd_use_journal", default=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I just found some old comments I forgot to submit, maybe something is still applicable?
def run(self, tmp, task_vars): | ||
"""Check that Fluentd has running pods, and that its logging config matches Docker's logging config.""" | ||
|
||
self.logging_namespace = get_var(task_vars, "openshift_logging_namespace", default="logging") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm noticing this is perhaps unnecessarily different than the approach in https://github.com/openshift/openshift-ansible/pull/4682/files#diff-a2a36e6105b8e46e1282819c771393beR22
Better choose a single way of doing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will stick to a default class field
config_error = self.check_logging_config(task_vars) | ||
if config_error: | ||
msg = ("The following Fluentd logging configuration problem was found:" | ||
"\n-------\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a separator? We don't seem to include it in any other error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a legacy from when the logging check was monolithic and you could get a bunch of different msgs for the different components under one check. Dividers probably don't make sense with them all split out in separate checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a legacy from when the logging check was monolithic and you could get a bunch of different msgs for the different components under one check. Dividers probably don't make sense with them all split out in separate checks.
ack, will go ahead and remove them from existing checks as well
"{}".format(config_error)) | ||
return {"failed": True, "changed": False, "msg": msg} | ||
|
||
return {"failed": False, "changed": False, "msg": 'No problems found with Fluentd logging configuration.'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the "msg"
used when "failed": False
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not, except for showing up in -v
output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect on '-v' output it will appear in the middle of a big json object?
I'd rather have less noise. failed: False
conveys the same info.
I have also mixed feelings about the "changed" key. I don't feel confident we use it consistently and correctly, and we don't really use it in any meaningful way. I remember I wanted to "be prepared" for supporting check-mode but thus far I feel we're only carrying extra code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I don't see a need for a msg here. No one will look at it. There was some place where it just made the logic cleaner to set one but there's no need most of the time.
"changed" response doesn't really have anything to do with check-mode. It's just a record that something was changed during this task, which gets summed up in the output at the end. Idempotency seems to be a big deal for Ansible users so reporting whether you actually changed something (e.g. installed a dependency) or not may matter to some. It does not seem critical to me but I'm not an admin. If we want a better way to track this we could set an instance variable and have the action plugin look at them to determine whether anything changed (even if an exception was thrown) and just keep all the "changed" response logic out of the actual checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove "changed" and "msg" keys
'from your Docker containers.\n').format(driver=logging_driver) | ||
|
||
if error: | ||
error += ('To resolve this issue, add the following variable to your Ansible inventory file:\n\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: add a new line \n
at the beginning of this string here, and remove it from the end of both error messages above.
From a maintenance point of view, this makes it less likely that we introduce a new error string and have it concatenated with no space or new line in between.
af33581
to
f422f41
Compare
return check | ||
|
||
|
||
def canned_fluentd_pod(containers=[]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a flake8 error here. Not a problem in this case, but sometimes mutable containers in arguments cause hairy problems...
Seems that we never call the function without an argument, so why don't simply make it required? (s/=[]//)
|
||
def run(self, tmp, task_vars): | ||
"""Check that Fluentd has running pods, and that its logging config matches Docker's logging config.""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extraneous empty line
@rhcarvalho thanks for the review, comments addressed |
@@ -68,7 +65,7 @@ def assert_error(error, expect_error): | |||
), | |||
]) | |||
def test_check_kibana(pods, expect_error): | |||
check = canned_kibana() | |||
check = kibana() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to another review today, maybe this is just Kibana()
and no need for a fixture.
components[0] = openshift_major_release_version[components[0]] | ||
|
||
components = [int(x) for x in components[:2]] | ||
return tuple(components) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to return a tuple, no need to create a list first...
components = tuple(int(x) for x in components[:2])
return components
if components[0] in openshift_major_release_version: | ||
components[0] = openshift_major_release_version[components[0]] | ||
|
||
components = [int(x) for x in components[:2]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't openshift_image_tag
carry a non-numeric version component, like rcN
or beta
or alpha
, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure
cc @sosiouxme
EDIT: we currently use the value of openshift_image_tag
in the package_version check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it can pretty much be anything. It's an opaque string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should we handle non numeric version values; try, except? or maybe fail the check altogether
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spoke too soon. Actually openshift_version
asserts that it look like a version string. The requirements for origin vs enterprise are a little different but we should be confident that it has an optional v
and two numeric components at the beginning. There can be any kind of garbage after the base version in origin though.
(Sorry; I spent too long looking at this at some point in the past, figured it all out, then forgot it. That said, should possibly revisit to see if there's something better to base it on.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I am only focusing on the first two numeric components of the string as it is. Not sure if there is a better var to base this on, but will take a look at sample inventory files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For origin the expression includes arbitrary letters, but the first two components should be numeric.
openshift_image_tag|match('(^v?\\d+\\.\\d+\\.\\d+(-[\\w\\-\\.]*)?$)')
We should be all set, thanks.
@@ -105,6 +105,28 @@ def get_var(self, *keys, **kwargs): | |||
raise OpenShiftCheckException("'{}' is undefined".format(".".join(map(str, keys)))) | |||
return value | |||
|
|||
@staticmethod | |||
def get_openshift_version(openshift_image_tag): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should have a more specific name, since it strips out version information? It is returning MAJOR.MINOR and throwing away the rest.
9e03112
to
c0591d3
Compare
Some refactoring on the way that parallels some of this: #4913 |
c0591d3
to
153fbaa
Compare
aos-ci-test |
logging_deployed = self.get_var("openshift_hosted_logging_deploy", default=False) | ||
|
||
try: | ||
version = self.get_major_minor_version(self.get_var("openshift_image_tag")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this can raise OpenShiftCheckException
. We should protect the action plugin call to is_active
with a try... except
block for cases like this, to prevent the action plugin run from blowing up. I can include that in my current work branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good idea. While you're at it, probably best to add an except Exception
to catch any dumb bug we leave on one check so the others still run.
@rhcarvalho tagging this for [merge] per scrum discussion |
[test]ing while waiting on the merge queue |
re[test] |
153fbaa
to
04154a2
Compare
Evaluated for openshift ansible merge up to 04154a2 |
Evaluated for openshift ansible test up to 04154a2 |
[ERROR] Status 'fedora/25/atomic' is in the 'failure' state. This status must succeed for a merge. Wait - what? That's new. |
continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_openshift_ansible/391/) (Base Commit: 01e4893) (PR Branch Commit: 04154a2) |
aos-ci-test |
bot, retest this |
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/780/) (Base Commit: beb5dea) (PR Branch Commit: 04154a2) |
Related trello card: https://trello.com/c/t6H0D8Ox
Checks to see if the fluentd pod is configured to aggregate logs from
journald
orjson-file
and verifies that the host's Docker config matches the expected logging driver.TODO
cc @sosiouxme @rhcarvalho @brenton